
Summary
The presented project aims to develop a model to evaluate if a costumer has high probability of living the company. This is of extreme importance because it will allow to identify and, for example, offer a better plan that better satisfies the costumer in order to keep it as a client.
15 % of the costumers presented in the data set are churn.
At the end with a classification model (XGBoost) we were able to estimate with 98% is satisfied or not with the current tele company.
This competition is about predicting whether a customer will change telecommunications provider, something known as churning.
The training dataset contains 4250 samples. Each sample contains 19 features and 1 boolean variable churn which indicates the class of the sample. The 19 input features and 1 target variable are:
import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)
import pandas as pd
import numpy as np
# Plotting
import matplotlib
from matplotlib import transforms, pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from scipy.stats import shapiro
# Train model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from imblearn.over_sampling import SMOTE, ADASYN, SMOTENC
# define colors
GRAY1, GRAY2, GRAY3 = '#231F20', '#414040', '#555655'
GRAY4, GRAY5, GRAY6 = '#646369', '#76787B', '#828282'
GRAY7, GRAY8, GRAY9 = '#929497', '#A6A6A5', '#BFBEBE'
BLUE1, BLUE2, BLUE3, BLUE4 = '#174A7E', '#4A81BF', '#94B2D7', '#94AFC5'
RED1, RED2 = '#C3514E', '#E6BAB7'
GREEN1, GREEN2 = '#0C8040', '#9ABB59'
ORANGE1 = '#F79747'
data = pd.read_csv('data/train.csv', delimiter=',')
data.head()
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | total_eve_minutes | total_eve_calls | total_eve_charge | total_night_minutes | total_night_calls | total_night_charge | total_intl_minutes | total_intl_calls | total_intl_charge | number_customer_service_calls | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OH | 107 | area_code_415 | no | yes | 26 | 161.6 | 123 | 27.47 | 195.5 | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | no |
| 1 | NJ | 137 | area_code_415 | no | no | 0 | 243.4 | 114 | 41.38 | 121.2 | 110 | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | no |
| 2 | OH | 84 | area_code_408 | yes | no | 0 | 299.4 | 71 | 50.90 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | no |
| 3 | OK | 75 | area_code_415 | yes | no | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | no |
| 4 | MA | 121 | area_code_510 | no | yes | 24 | 218.2 | 88 | 37.09 | 348.5 | 108 | 29.62 | 212.6 | 118 | 9.57 | 7.5 | 7 | 2.03 | 3 | no |
The train data has 4250 rows and 20 variables, being churn a binary type which we want to predict on test data.
NA values in data?
data.isna().sum()
state 0 account_length 0 area_code 0 international_plan 0 voice_mail_plan 0 number_vmail_messages 0 total_day_minutes 0 total_day_calls 0 total_day_charge 0 total_eve_minutes 0 total_eve_calls 0 total_eve_charge 0 total_night_minutes 0 total_night_calls 0 total_night_charge 0 total_intl_minutes 0 total_intl_calls 0 total_intl_charge 0 number_customer_service_calls 0 churn 0 dtype: int64
Null values in data?
data.isnull().sum()
state 0 account_length 0 area_code 0 international_plan 0 voice_mail_plan 0 number_vmail_messages 0 total_day_minutes 0 total_day_calls 0 total_day_charge 0 total_eve_minutes 0 total_eve_calls 0 total_eve_charge 0 total_night_minutes 0 total_night_calls 0 total_night_charge 0 total_intl_minutes 0 total_intl_calls 0 total_intl_charge 0 number_customer_service_calls 0 churn 0 dtype: int64
Duplicates samples?
data.duplicated().sum()
0
There are no NA, null value or duplicated entries.
Let split the data into train (80%), validation (10%) and test (10%) in order to avoid leaked information into the model. Because we don't have a high quantity of data we will keep 80% for the training.
X = data.drop('churn', axis = 1)
y = data.churn
X_train, X_testval, y_train, y_testval = train_test_split(X, y,
test_size = 0.2,
shuffle= True)
X_test, X_val, y_test, y_val = train_test_split(X_testval, y_testval,
test_size= 0.5,
shuffle= True)
print(f'Train shape: {X_train.shape}')
print(f'Validation shape: {X_val.shape}')
print(f'Test shape: {X_test.shape}')
print(f'Sum of rows is ok: {X_train.shape[0] + X_val.shape[0] + X_test.shape[0] == data.shape[0] }')
Train shape: (3400, 19) Validation shape: (425, 19) Test shape: (425, 19) Sum of rows is ok: True
We will have 3400 entries to train the model, 425 to optimize the model by tuning hyper parameters and 425 to make the final evaluation of model performance.
X_train.head()
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | total_eve_minutes | total_eve_calls | total_eve_charge | total_night_minutes | total_night_calls | total_night_charge | total_intl_minutes | total_intl_calls | total_intl_charge | number_customer_service_calls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4056 | MI | 62 | area_code_408 | no | no | 0 | 119.3 | 94 | 20.28 | 224.0 | 81 | 19.04 | 156.7 | 78 | 7.05 | 12.1 | 6 | 3.27 | 5 |
| 4017 | CO | 97 | area_code_415 | no | yes | 27 | 125.9 | 90 | 21.40 | 123.4 | 91 | 10.49 | 230.0 | 71 | 10.35 | 7.3 | 7 | 1.97 | 0 |
| 2819 | VT | 60 | area_code_415 | no | no | 0 | 193.9 | 118 | 32.96 | 85.0 | 110 | 7.23 | 210.1 | 134 | 9.45 | 13.2 | 8 | 3.56 | 3 |
| 1069 | MT | 74 | area_code_415 | no | no | 0 | 162.7 | 102 | 27.66 | 292.0 | 105 | 24.82 | 183.3 | 80 | 8.25 | 8.7 | 6 | 2.35 | 0 |
| 344 | AZ | 117 | area_code_408 | no | no | 0 | 239.9 | 84 | 40.78 | 174.8 | 106 | 14.86 | 209.5 | 93 | 9.43 | 9.8 | 2 | 2.65 | 0 |
First we will simplify the area_code variable to leave only the 3 digits.
X_train2 = X_train.copy()
X_train2['area_code'] =X_train['area_code'].apply(lambda x: x.split('_')[-1])
X_train2.head()
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | total_eve_minutes | total_eve_calls | total_eve_charge | total_night_minutes | total_night_calls | total_night_charge | total_intl_minutes | total_intl_calls | total_intl_charge | number_customer_service_calls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4056 | MI | 62 | 408 | no | no | 0 | 119.3 | 94 | 20.28 | 224.0 | 81 | 19.04 | 156.7 | 78 | 7.05 | 12.1 | 6 | 3.27 | 5 |
| 4017 | CO | 97 | 415 | no | yes | 27 | 125.9 | 90 | 21.40 | 123.4 | 91 | 10.49 | 230.0 | 71 | 10.35 | 7.3 | 7 | 1.97 | 0 |
| 2819 | VT | 60 | 415 | no | no | 0 | 193.9 | 118 | 32.96 | 85.0 | 110 | 7.23 | 210.1 | 134 | 9.45 | 13.2 | 8 | 3.56 | 3 |
| 1069 | MT | 74 | 415 | no | no | 0 | 162.7 | 102 | 27.66 | 292.0 | 105 | 24.82 | 183.3 | 80 | 8.25 | 8.7 | 6 | 2.35 | 0 |
| 344 | AZ | 117 | 408 | no | no | 0 | 239.9 | 84 | 40.78 | 174.8 | 106 | 14.86 | 209.5 | 93 | 9.43 | 9.8 | 2 | 2.65 | 0 |
Replace yes and no by 0 and 1, respectively.
X_train3 = X_train2.copy()
X_train3["international_plan"] = X_train2["international_plan"].map(dict(yes=1, no=0))
X_train3["voice_mail_plan"] = X_train2["voice_mail_plan"].map(dict(yes=1, no=0))
X_train3.head()
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | total_eve_minutes | total_eve_calls | total_eve_charge | total_night_minutes | total_night_calls | total_night_charge | total_intl_minutes | total_intl_calls | total_intl_charge | number_customer_service_calls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4056 | MI | 62 | 408 | 0 | 0 | 0 | 119.3 | 94 | 20.28 | 224.0 | 81 | 19.04 | 156.7 | 78 | 7.05 | 12.1 | 6 | 3.27 | 5 |
| 4017 | CO | 97 | 415 | 0 | 1 | 27 | 125.9 | 90 | 21.40 | 123.4 | 91 | 10.49 | 230.0 | 71 | 10.35 | 7.3 | 7 | 1.97 | 0 |
| 2819 | VT | 60 | 415 | 0 | 0 | 0 | 193.9 | 118 | 32.96 | 85.0 | 110 | 7.23 | 210.1 | 134 | 9.45 | 13.2 | 8 | 3.56 | 3 |
| 1069 | MT | 74 | 415 | 0 | 0 | 0 | 162.7 | 102 | 27.66 | 292.0 | 105 | 24.82 | 183.3 | 80 | 8.25 | 8.7 | 6 | 2.35 | 0 |
| 344 | AZ | 117 | 408 | 0 | 0 | 0 | 239.9 | 84 | 40.78 | 174.8 | 106 | 14.86 | 209.5 | 93 | 9.43 | 9.8 | 2 | 2.65 | 0 |
Convert objects to category on both samples and label datafrmes.
for col in ['state', 'area_code', 'international_plan', 'voice_mail_plan']:
X_train3[col] = X_train3[col].astype('category')
y_train = y_train.astype('category')
y_train3 = y_train.copy()
y_train3 = y_train3.map(dict(yes=1, no=0))
An initial visualization was developed considering the churn percentage at each state, the account length and total charge, The dashboard built on Tableau is interactive and by selecting the state in the tree map the histogram, map and box plot will change. It can be seen that 'no churn' clients normally charged less than the others. This can be an important feature to evaluate if a client is satisfied or not.
%%html
<div class='tableauPlaceholder' id='viz1661427519889' style='position: relative'><noscript><a href='#'><img alt='Dashboard 2 ' src='https://public.tableau.com/static/images/Pr/Project4_16562381181040/Dashboard2/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Project4_16562381181040/Dashboard2' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Pr/Project4_16562381181040/Dashboard2/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='pt-BR' /><param name='filter' value='publish=yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1661427519889'); var vizElement = divElement.getElementsByTagName('object')[0]; if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1366px';vizElement.style.height='795px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1366px';vizElement.style.height='795px';} else { vizElement.style.width='100%';vizElement.style.height='1277px';} var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
Firstly will we join together samples and targets on the same dataframe in order to compare statistics between the clients that are identified as churn and the ones that are not. Secondly we will build an interactive dashboard on Tableau and bring the main insights to the notebook.
df = pd.concat([X_train3, y_train3], axis = 1)
df.head()
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | total_eve_minutes | total_eve_calls | total_eve_charge | total_night_minutes | total_night_calls | total_night_charge | total_intl_minutes | total_intl_calls | total_intl_charge | number_customer_service_calls | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4056 | MI | 62 | 408 | 0 | 0 | 0 | 119.3 | 94 | 20.28 | 224.0 | 81 | 19.04 | 156.7 | 78 | 7.05 | 12.1 | 6 | 3.27 | 5 | 1 |
| 4017 | CO | 97 | 415 | 0 | 1 | 27 | 125.9 | 90 | 21.40 | 123.4 | 91 | 10.49 | 230.0 | 71 | 10.35 | 7.3 | 7 | 1.97 | 0 | 0 |
| 2819 | VT | 60 | 415 | 0 | 0 | 0 | 193.9 | 118 | 32.96 | 85.0 | 110 | 7.23 | 210.1 | 134 | 9.45 | 13.2 | 8 | 3.56 | 3 | 0 |
| 1069 | MT | 74 | 415 | 0 | 0 | 0 | 162.7 | 102 | 27.66 | 292.0 | 105 | 24.82 | 183.3 | 80 | 8.25 | 8.7 | 6 | 2.35 | 0 | 0 |
| 344 | AZ | 117 | 408 | 0 | 0 | 0 | 239.9 | 84 | 40.78 | 174.8 | 106 | 14.86 | 209.5 | 93 | 9.43 | 9.8 | 2 | 2.65 | 0 | 0 |
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=2, cols=1)
fig.add_trace(
go.Bar(x = df.state.value_counts().index,
y = df.state.value_counts().values,
marker_color = BLUE4,
opacity = 0.6,
name = ''),
row=1, col=1)
fig.add_trace(
go.Histogram(x=df.state.value_counts(),
marker_color = BLUE1,
opacity = 0.6,
name = ''),
row=2, col=1
)
# edit axis labels
fig['layout']['xaxis2']['title']='# Samples per state'
fig['layout']['yaxis']['title']='# Samples per state'
fig['layout']['yaxis2']['title']='State count'
# Edit the layout
fig.update_layout(template = 'simple_white',
font_family="Arial",
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
showlegend=False)
fig.update_layout(height=800, width=950, title_text="State Variable")
fig.show()
The overall number of samples for each state is 65, with a range from 50 and 90. The CA state is the state with the fewest number of samples (31), while WV is the one with the largest number of samples on train data (115). Lets see if the percentage of churn clients changes between states.
df_vis = df.copy()
df_vis['churn'] = df.churn.astype('int')
churn_perc = round(df_vis.groupby('state').sum()['churn'] *100 / df_vis.groupby('state').count()['churn'],
2 ).sort_values(ascending = False)
pd.DataFrame(churn_perc.describe())
| churn | |
|---|---|
| count | 51.000000 |
| mean | 14.082157 |
| std | 5.194213 |
| min | 3.700000 |
| 25% | 10.440000 |
| 50% | 13.040000 |
| 75% | 17.435000 |
| max | 25.370000 |
stat, p = shapiro(churn_perc)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
print('Sample looks Gaussian (fail to reject H0)')
else:
print('Sample does not look Gaussian (reject H0)')
Statistics=0.973, p=0.291 Sample looks Gaussian (fail to reject H0)
fig = make_subplots(rows=2, cols=1)
fig.add_trace(
go.Bar(x = churn_perc.index,
y = churn_perc.values,
marker_color = BLUE4,
opacity = 0.6,
name = ''),
row=1, col=1)
fig.add_trace(
go.Histogram(x=churn_perc,
marker_color = BLUE1,
nbinsx = 15,
opacity = 0.6,
name = ''),
row=2, col=1
)
# edit axis labels
fig['layout']['xaxis2']['title']='percentage (%)'
fig['layout']['yaxis']['title']='percentage (%)'
fig['layout']['yaxis2']['title']='state count'
# Edit the layout
fig.update_layout(template = 'simple_white',
font_family="Arial",
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
showlegend=False)
fig.update_layout(height=800, width=950, title_text="% of churn per state")
fig.show()
print(f'The minimum time that a client was in the company was {df.account_length.min()} month.')
print(f'The average time is {round(df.account_length.median()/12, 0)} years / {round(df.account_length.median(), 0)} months')
print(f'The maximum time a client was with the company is {round(df.account_length.max()/12, 0)} years')
The minimum time that a client was in the company was 1 month. The average time is 8.0 years / 99.0 months The maximum time a client was with the company is 20.0 years
fig = go.Figure()
# Use x instead of y argument for horizontal plot
fig.add_trace(go.Box(x=df.account_length, name =''))
fig.update_layout(title = 'Account length (months)',
height=400, width=950,
template = 'simple_white',
font_family="Arial",
font_size = 18,
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
showlegend=False)
fig.show()
Lets compare the account length distribution between churn and no churn clients.
group_labels = ['Yes', 'No']
colors = [GRAY9, BLUE2]
# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot([df[df.churn == 1].account_length, df[df.churn == 0].account_length], group_labels,
bin_size= 25, curve_type='normal', # override default 'kde'
colors=colors, show_rug=False)
fig['layout']['xaxis']['title']='Account length (months)'
# Add title
fig.update_layout(height=400, width=950,
template = 'simple_white',
font_family="Arial",
font_size = 18,
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
legend_title_text='Churn',
showlegend=True)
fig.show()
It can be seen that the account length distribution has no difference between clients with churn and clients without churn. We can see that although a person might be working with the telecom company for a long time, it might not be a reason to avoid the churn.
Lets compare the number of clients with international and voice mail plans compare with churn and not churn.
# International Plan
new_index = ['No', 'Yes']
df_1_int = round(df[df.churn == 1]['international_plan'].value_counts() * 100 / df[df.churn == 1]['international_plan'].value_counts().sum() ,0)
df_1_int.index = new_index
df_0_int = round(df[df.churn == 0]['international_plan'].value_counts() * 100 / df[df.churn == 0]['international_plan'].value_counts().sum() ,0)
df_0_int.index = new_index
# Voice mail plan
df_1_voi = round(df[df.churn == 1]['voice_mail_plan'].value_counts() * 100 / df[df.churn == 1]['voice_mail_plan'].value_counts().sum() ,0)
df_1_voi.index = new_index
df_0_voi = round(df[df.churn == 0]['voice_mail_plan'].value_counts() * 100 / df[df.churn == 0]['voice_mail_plan'].value_counts().sum() ,0)
df_0_voi.index = new_index
fig = make_subplots(rows=2, cols=2,
subplot_titles=('No Churn - International Plan',
'Churn - International Plan',
'No Churn - Voice Mail Plan',
'Churn - Voice Mail Plan'))
colors = [GRAY9, BLUE1]
fig.add_trace(
go.Bar(x = df_0_int.index,
y = df_0_int.values,
marker_color=colors,
opacity = 0.6,
name = 'No Churn - International Plan'),
row=1, col=1)
fig.add_trace(
go.Bar(x = df_0_voi.index,
y = df_0_voi.values,
marker_color=colors,
opacity = 0.6,
name = 'Churn - International Plan'),
row=1, col=2)
fig.add_trace(
go.Bar(x = df_1_int.index,
y = df_1_int.values,
marker_color=colors,
opacity = 0.6,
name = 'No Churn - Voice Mail Plan'),
row=2, col=1)
fig.add_trace(
go.Bar(x = df_1_voi.index,
y = df_1_voi.values,
marker_color=colors,
opacity = 0.6,
name = 'No Churn - Voice Mail Plan'),
row=2, col=2)
# Edit the layout
fig.update_layout(template = 'simple_white',
font_family="Arial",
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
showlegend=False)
fig['layout']['yaxis']['title']='Percentage (%)'
fig['layout']['yaxis3']['title']='Percentage (%)'
fig.update_layout(height=600, width=950)
fig.update_layout(yaxis1 = dict(range=[0, 100]),
yaxis2 = dict(range=[0, 100]),
yaxis3 = dict(range=[0, 100]),
yaxis4 = dict(range=[0, 100]))
fig.show()
From the customers with voice mail plan, the number of messages in the voice-mail follows a normal distribution with an average of around 30 messages. There is not an evident difference between churn and no-churn clients.
# Create distplot with curve_type set to 'normal'
fig = go.Figure()
df_plot = df[df.voice_mail_plan == 1]
fig.add_trace(
go.Histogram(x=df_plot[df_plot.churn == 1].number_vmail_messages,
histnorm = 'percent',
marker_color = BLUE1,
nbinsx = 15,
opacity = 0.4,
name = 'Yes'))
fig.add_trace(
go.Histogram(x=df_plot[df_plot.churn == 0].number_vmail_messages,
histnorm = 'percent',
marker_color = ORANGE1,
nbinsx = 15,
opacity = 0.4,
name = 'No'))
fig['layout']['xaxis']['title']='Number of voice mail messages'
fig['layout']['yaxis']['title']='Percentage (%)'
# Add title
fig.update_layout(height=400, width=950,
template = 'simple_white',
font_family="Arial",
font_size = 18,
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
legend_title_text='Churn',
showlegend=True)
mean_churn = df_plot[df_plot.churn == 1].number_vmail_messages.median()
mean_no_churn = df_plot[df_plot.churn == 0].number_vmail_messages.median()
fig.add_annotation(x=10, y=25,
text= 'Mean Churn: ' +str(mean_churn),
showarrow=False)
fig.add_annotation(x=10, y=21,
text= 'Mean No Churn: ' +str(mean_no_churn),
showarrow=False)
fig.update_layout(barmode='group', bargap=0.10,bargroupgap=0.0)
fig.show()
Lets make a singular analysis of the total minutes, number of calls and the charge for each time of the day (day, evening and night).
hist_data = [df.total_day_minutes.values,
df.total_eve_minutes.values,
df.total_night_minutes.values]
group_labels = ['day', 'evening', 'night']
colors = [BLUE3, ORANGE1, GREEN2]
fig = ff.create_distplot(hist_data, group_labels,
bin_size=10,
show_hist=False,
show_rug= False,
colors=colors)
fig.update_layout(height=400, width=950,
title_text='total minutes',
template = 'simple_white',
font_family="Arial",
font_size = 18,
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
legend_title_text='day time',
showlegend=True)
fig.show()
hist_data = [df.total_day_calls.values,
df.total_eve_calls.values,
df.total_night_calls.values]
group_labels = ['day', 'evening', 'night']
colors = [BLUE3, ORANGE1, GREEN2]
fig = ff.create_distplot(hist_data, group_labels,
bin_size=10,
show_hist=False,
show_curve= True,
show_rug= False,
colors=colors)
fig.update_layout(height=400, width=950,
title_text='total calls',
template = 'simple_white',
font_family="Arial",
font_size = 18,
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
legend_title_text='day time',
showlegend=True)
fig.show()
hist_data = [df.total_day_charge.values,
df.total_eve_charge.values,
df.total_night_charge.values]
group_labels = ['day', 'evening', 'night']
colors = [BLUE3, ORANGE1, GREEN2]
fig = ff.create_distplot(hist_data, group_labels,
bin_size=1,
show_hist=True,
show_curve= False,
show_rug= False,
colors=colors)
fig.update_layout(height=400, width=950,
title_text='total charge',
template = 'simple_white',
font_family="Arial",
font_size = 18,
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
legend_title_text='day time',
showlegend=True)
fig.add_annotation(x=31, y=0.06,
text= str(round(df.total_day_charge.values.mean(),0 )),
showarrow=False)
fig.add_annotation(x=17, y=0.11,
text= str(round(df.total_eve_charge.values.mean(),0 )),
showarrow=False)
fig.add_annotation(x=9, y=0.18,
text= str(round(df.total_night_charge.values.mean(),0 )),
showarrow=False)
fig.show()
Although the number of calls and total call is similar the total amount of charge is very different between each time of the day. By day the price can be around 3.5x more expensive than at nigh.
fig = go.Figure()
fig.add_trace(
go.Scatter(x=df.total_day_minutes,
y = df.total_day_charge,
mode = 'markers',
marker_color = BLUE3,
opacity= 0.6,
name = 'day'))
fig.add_trace(
go.Scatter(x=df.total_eve_minutes,
y = df.total_eve_charge,
mode = 'markers',
marker_color = ORANGE1,
opacity= 0.6,
name = 'evening'))
fig.add_trace(
go.Scatter(x=df.total_night_minutes,
y = df.total_night_charge,
mode = 'markers',
marker_color = GREEN2,
opacity= 0.6,
name = 'night'))
fig['layout']['xaxis']['title']='Total amount of minutes'
fig['layout']['yaxis']['title']='Total charge'
fig.update_layout(height=400, width=900,
template = 'simple_white',
font_family="Arial",
font_size = 18,
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
legend_title_font_size = 18,
legend_title_text='day time',
showlegend=True)
fig.show()
The amount of charge is directly proportional with the amount of minutes used at each time of the day
Lets sum the minutes, charge and number of calls for the entire day.
df['total_minutes'] = df.total_day_minutes + df.total_eve_minutes + df.total_night_minutes
df['total_charge'] = df.total_day_charge + df.total_eve_charge + df.total_night_charge
df['total_calls'] = df.total_day_calls + df.total_eve_calls + df.total_night_calls
df['charge_per_minute'] = df['total_charge'] / df['total_minutes']
df['charge_per_call'] = df['total_charge'] / df['total_calls']
df.head()
| state | account_length | area_code | international_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | total_eve_minutes | ... | total_intl_minutes | total_intl_calls | total_intl_charge | number_customer_service_calls | churn | total_minutes | total_charge | total_calls | charge_per_minute | charge_per_call | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4056 | MI | 62 | 408 | 0 | 0 | 0 | 119.3 | 94 | 20.28 | 224.0 | ... | 12.1 | 6 | 3.27 | 5 | 1 | 500.0 | 46.37 | 253 | 0.092740 | 0.183281 |
| 4017 | CO | 97 | 415 | 0 | 1 | 27 | 125.9 | 90 | 21.40 | 123.4 | ... | 7.3 | 7 | 1.97 | 0 | 0 | 479.3 | 42.24 | 252 | 0.088129 | 0.167619 |
| 2819 | VT | 60 | 415 | 0 | 0 | 0 | 193.9 | 118 | 32.96 | 85.0 | ... | 13.2 | 8 | 3.56 | 3 | 0 | 489.0 | 49.64 | 362 | 0.101513 | 0.137127 |
| 1069 | MT | 74 | 415 | 0 | 0 | 0 | 162.7 | 102 | 27.66 | 292.0 | ... | 8.7 | 6 | 2.35 | 0 | 0 | 638.0 | 60.73 | 287 | 0.095188 | 0.211603 |
| 344 | AZ | 117 | 408 | 0 | 0 | 0 | 239.9 | 84 | 40.78 | 174.8 | ... | 9.8 | 2 | 2.65 | 0 | 0 | 624.2 | 65.07 | 283 | 0.104245 | 0.229929 |
5 rows × 25 columns
fig = make_subplots(rows=1, cols=3)
# Total minutes
fig.add_trace(go.Box(x=df[df.churn == 1].total_minutes,
marker_color = ORANGE1,
name = 'Yes'),
row = 1, col = 1)
fig.add_trace(go.Box(x=df[df.churn == 0].total_minutes,
marker_color = GRAY8,
name = 'No'),
row = 1, col = 1)
# Total charge
fig.add_trace(go.Box(x=df[df.churn == 1].total_charge,
marker_color = ORANGE1,
name = ''),
row = 1, col = 2)
fig.add_trace(go.Box(x=df[df.churn == 0].total_charge,
marker_color = GRAY8,
name = ' '),
row = 1, col = 2)
# Total calls
fig.add_trace(go.Box(x=df[df.churn == 1].total_calls,
marker_color = ORANGE1,
name = ''),
row = 1, col = 3)
fig.add_trace(go.Box(x=df[df.churn == 0].total_calls,
marker_color = GRAY8,
name = ' '),
row = 1, col = 3)
fig['layout']['xaxis1']['title']='total minutes'
fig['layout']['xaxis2']['title']='total charge'
fig['layout']['xaxis3']['title']='total calls'
fig['layout']['yaxis1']['title']='Churn'
fig.update_layout(height=400, width=950,
template = 'simple_white',
font_family="Arial",
font_size = 18,
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
showlegend=False)
fig.show()
As expected the number of customer service calls has a significant difference between satisfied and non-satisfied clients. Clients not satisfied usually do more calls to the customer service. Considering the histplot below it can been seen that satisfied clients call an average of 1 time (considering the median), while the churn clients call 2 times.
group_labels = ['Yes', 'No']
colors = [ ORANGE1, GRAY9]
# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot([df[df.churn == 1].number_customer_service_calls,
df[df.churn == 0].number_customer_service_calls], group_labels,
#bin_size= 25,
curve_type='normal', # override default 'kde'
show_hist = False,
colors=colors,
show_rug=False)
fig['layout']['xaxis']['title']='# customer service calls'
# Add title
fig.update_layout(height=400, width=950,
template = 'simple_white',
font_family="Arial",
font_size = 18,
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
legend_title_text='Churn',
showlegend=True)
fig.show()
fig = go.Figure()
# Total minutes
fig.add_trace(go.Box(x=df[df.churn == 1].number_customer_service_calls,
marker_color = ORANGE1,
name = 'Yes'))
fig.add_trace(go.Box(x=df[df.churn == 0].number_customer_service_calls,
marker_color = GRAY8,
name = 'No'))
fig['layout']['xaxis1']['title']='# customer service calls'
fig['layout']['yaxis1']['title']='Churn'
fig.update_layout(height=400, width=950,
template = 'simple_white',
font_family="Arial",
font_size = 18,
font_color= GRAY6,
title_font_family="Arial Bold",
title_font_size = 25,
title_font_color= GRAY3,
legend_title_font_color=GRAY6,
showlegend=False)
y_train3.value_counts()
0 2920 1 480 Name: churn, dtype: int64
We can see that the churn variable, which we want to predict is unbalanced. Because we have a small amount of data we will upsample the minor category ('yes'/1)
# Get categorical columns indices
cols_category = X_train3.select_dtypes(include=['category'])
cols = X_train3.columns
col_idx = [ind for ind, x in enumerate(cols) if x in cols_category]
X_res, y_res = SMOTENC( categorical_features = col_idx,
n_jobs = -1).fit_resample(X_train3, y_train3)
y_res.value_counts()
0 2920 1 2920 Name: churn, dtype: int64
The train data is now balanced with a total of 5830 rows.
Lets take a closer look to the final dataset to understand how we can perform the feature engineering in order to get the most value from the data. The first thing we can improve is the area_code variable which is a categorical variable with only 3 distinct values (415, 408, 510). Lets create dummy variables.
X_res_dummy = pd.get_dummies(X_res, columns= [ 'area_code', 'state'], drop_first= True)
X_res_dummy.head()
| account_length | international_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | total_eve_minutes | total_eve_calls | total_eve_charge | ... | state_SD | state_TN | state_TX | state_UT | state_VA | state_VT | state_WA | state_WI | state_WV | state_WY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 62 | 0 | 0 | 0 | 119.3 | 94 | 20.28 | 224.0 | 81 | 19.04 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 97 | 0 | 1 | 27 | 125.9 | 90 | 21.40 | 123.4 | 91 | 10.49 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 60 | 0 | 0 | 0 | 193.9 | 118 | 32.96 | 85.0 | 110 | 7.23 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 74 | 0 | 0 | 0 | 162.7 | 102 | 27.66 | 292.0 | 105 | 24.82 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 117 | 0 | 0 | 0 | 239.9 | 84 | 40.78 | 174.8 | 106 | 14.86 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 69 columns
We will build several features to support the machine learning. We will get detailed information on national calls and overall calls cost, time and number.
X_res_fe = X_res_dummy.copy()
# National Calls features
X_res_fe['total_national_calls'] = X_res_dummy.total_day_calls + X_res_dummy.total_eve_calls + X_res_dummy.total_night_calls
X_res_fe['total_national_charge'] = X_res_dummy.total_day_charge + X_res_dummy.total_eve_charge + X_res_dummy.total_night_charge
X_res_fe['total_national_minutes'] = X_res_dummy.total_day_minutes + X_res_dummy.total_eve_minutes + X_res_dummy.total_night_minutes
X_res_fe['national_cost_by_calls'] = X_res_fe['total_national_charge']/X_res_fe['total_national_calls']
X_res_fe['national_cost_by_minutes'] = X_res_fe['total_national_charge']/X_res_fe['total_national_minutes']
# Total Calls features
X_res_fe['total_calls'] = X_res_dummy.total_day_calls + X_res_dummy.total_eve_calls + X_res_dummy.total_night_calls + X_res_dummy.total_intl_calls
X_res_fe['total_charge'] = X_res_dummy.total_day_charge + X_res_dummy.total_eve_charge + X_res_dummy.total_night_charge + X_res_dummy.total_intl_charge
X_res_fe['total_minutes'] = X_res_dummy.total_day_minutes + X_res_dummy.total_eve_minutes + X_res_dummy.total_night_minutes + X_res_dummy.total_intl_minutes
X_res_fe['cost_by_calls'] = X_res_fe['total_charge']/X_res_fe['total_calls']
X_res_fe['cost_by_minutes'] = X_res_fe['total_charge']/X_res_fe['total_minutes']
X_res_fe.head()
| account_length | international_plan | voice_mail_plan | number_vmail_messages | total_day_minutes | total_day_calls | total_day_charge | total_eve_minutes | total_eve_calls | total_eve_charge | ... | total_national_calls | total_national_charge | total_national_minutes | national_cost_by_calls | national_cost_by_minutes | total_calls | total_charge | total_minutes | cost_by_calls | cost_by_minutes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 62 | 0 | 0 | 0 | 119.3 | 94 | 20.28 | 224.0 | 81 | 19.04 | ... | 253 | 46.37 | 500.0 | 0.183281 | 0.092740 | 259 | 49.64 | 512.1 | 0.191660 | 0.096934 |
| 1 | 97 | 0 | 1 | 27 | 125.9 | 90 | 21.40 | 123.4 | 91 | 10.49 | ... | 252 | 42.24 | 479.3 | 0.167619 | 0.088129 | 259 | 44.21 | 486.6 | 0.170695 | 0.090855 |
| 2 | 60 | 0 | 0 | 0 | 193.9 | 118 | 32.96 | 85.0 | 110 | 7.23 | ... | 362 | 49.64 | 489.0 | 0.137127 | 0.101513 | 370 | 53.20 | 502.2 | 0.143784 | 0.105934 |
| 3 | 74 | 0 | 0 | 0 | 162.7 | 102 | 27.66 | 292.0 | 105 | 24.82 | ... | 287 | 60.73 | 638.0 | 0.211603 | 0.095188 | 293 | 63.08 | 646.7 | 0.215290 | 0.097541 |
| 4 | 117 | 0 | 0 | 0 | 239.9 | 84 | 40.78 | 174.8 | 106 | 14.86 | ... | 283 | 65.07 | 624.2 | 0.229929 | 0.104245 | 285 | 67.72 | 634.0 | 0.237614 | 0.106814 |
5 rows × 79 columns
Most common machine learning models require the pre-processing of the data, particularly their normalization. SVM, KNN and others use the distance between instances reason why different scales can complicate models convergence.
# All columns
col_names = X_res_fe.columns
# Numerical features to normalize
col_to_normalize= ['account_length', 'number_vmail_messages',
'total_day_minutes', 'total_day_calls', 'total_day_charge',
'total_eve_minutes', 'total_eve_calls', 'total_eve_charge',
'total_night_minutes', 'total_night_calls', 'total_night_charge',
'total_intl_minutes', 'total_intl_calls', 'total_intl_charge',
'number_customer_service_calls',
'total_national_calls', 'total_national_charge', 'total_national_minutes', 'national_cost_by_calls', 'national_cost_by_minutes',
'total_calls', 'total_charge', 'total_minutes', 'cost_by_calls', 'cost_by_minutes']
# Non numerical features
col_not_to_normalize = [name for name in col_names if name not in col_to_normalize]
transf = make_column_transformer(
(StandardScaler(), col_to_normalize), remainder = 'passthrough')
X_final = pd.DataFrame(transf.fit_transform(X_res_fe), columns= col_to_normalize + col_not_to_normalize)
In order to make churn prediction we will be testing different classification models described below:
The prediction performance will be evaluated considering the accuracy metric as described in kaggle challenge.
$ Accuracy = \frac{ Number of correct predictions} {Number of samples} $
#models
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
import xgboost as xgb
from xgboost import plot_importance
from sklearn.tree import export_graphviz
import lightgbm as lgb
# metrics
from sklearn.metrics import confusion_matrix, accuracy_score
# Model Tuning
from sklearn.model_selection import RandomizedSearchCV
First we will need to build a function to apply the same feature engineering developed for the train data into the validation and test dataset.
def df_feat_eng(df_input):
raw_data = df_input.copy()
raw_data['area_code'] = raw_data['area_code'].apply(lambda x: x.split('_')[-1])
raw_data["international_plan"] = raw_data["international_plan"].map(dict(yes=1, no=0))
raw_data["voice_mail_plan"] = raw_data["voice_mail_plan"].map(dict(yes=1, no=0))
for col in ['state', 'area_code', 'international_plan', 'voice_mail_plan']:
raw_data[col] = raw_data[col].astype('category')
raw_data['total_minutes'] = raw_data.total_day_minutes + raw_data.total_eve_minutes + raw_data.total_night_minutes
raw_data['total_charge'] = raw_data.total_day_charge + raw_data.total_eve_charge + raw_data.total_night_charge
raw_data['total_calls'] = raw_data.total_day_calls + raw_data.total_eve_calls + raw_data.total_night_calls
raw_data['charge_per_minute'] = raw_data['total_charge'] / raw_data['total_minutes']
raw_data['charge_per_call'] = raw_data['total_charge'] / raw_data['total_calls']
X_raw= pd.get_dummies(raw_data, columns= [ 'area_code', 'state'], drop_first= True)
# National Calls features
X_raw['total_national_calls'] = X_raw.total_day_calls + X_raw.total_eve_calls + X_raw.total_night_calls
X_raw['total_national_charge'] = X_raw.total_day_charge + X_raw.total_eve_charge + X_raw.total_night_charge
X_raw['total_national_minutes'] = X_raw.total_day_minutes + X_raw.total_eve_minutes + X_raw.total_night_minutes
X_raw['national_cost_by_calls'] = X_raw['total_national_charge']/X_raw['total_national_calls']
X_raw['national_cost_by_minutes'] = X_raw['total_national_charge']/X_raw['total_national_minutes']
# Total Calls features
X_raw['total_calls'] = X_raw.total_day_calls + X_raw.total_eve_calls + X_raw.total_night_calls + X_raw.total_intl_calls
X_raw['total_charge'] = X_raw.total_day_charge + X_raw.total_eve_charge + X_raw.total_night_charge + X_raw.total_intl_charge
X_raw['total_minutes'] = X_raw.total_day_minutes + X_raw.total_eve_minutes + X_raw.total_night_minutes + X_raw.total_intl_minutes
X_raw['cost_by_calls'] = X_raw['total_charge']/X_raw['total_calls']
X_raw['cost_by_minutes'] = X_raw['total_charge']/X_raw['total_minutes']
X_fe_final = pd.DataFrame(transf.transform(X_raw), columns= col_to_normalize + col_not_to_normalize)
return X_fe_final
X_test_final = df_feat_eng(X_test)
# QC on columns number
X_test_final.shape[1] == X_final.shape[1]
True
# QC columns name
sum(X_test_final.columns == X_final.columns) == 79
True
# QC on columns type
sum(X_test_final.dtypes == X_final.dtypes) == X_final.shape[1]
True
# Change labels on test and validation to 0 and 1
y_test_res = y_test.map(dict(yes=1, no=0))
y_val_res = y_val.map(dict(yes=1, no=0))
svc = SVC(probability=True)
svc.fit(X_final, y_res)
SVC(probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(probability=True)
y_pred = svc.predict(X_final)
pd.DataFrame(confusion_matrix(y_res, y_pred),
columns = ['Pred 0', 'Pred 1'],
index = ['True 0', 'True 1'])
| Pred 0 | Pred 1 | |
|---|---|---|
| True 0 | 2771 | 149 |
| True 1 | 322 | 2598 |
# Score on train data
accuracy_score(y_res, y_pred)
0.9193493150684932
# Score on test data
accuracy_score(y_test_res, svc.predict(X_test_final))
0.8941176470588236
It can be seen that the model overfitted the train data. Considering the SVC model,
svc2 = SVC(probability=True,
C = 0.1,
kernel = 'rbf')
svc2.fit(X_final, y_res)
svc_train_score = round(accuracy_score(y_res , svc2.predict(X_final )), 2)
svc_test_score = round(accuracy_score(y_test_res, svc2.predict(X_test_final)), 2)
print(f'Train score: {svc_train_score}')
print(f'Test score: {svc_train_score}')
Train score: 0.84 Test score: 0.84
# Define model
logr = LogisticRegression(solver = 'lbfgs',
max_iter= 1000)
# Train model
logr.fit(X_final, y_res)
# Predict with train data
y_pred = logr.predict(X_final)
# Confusion matrix
pd.DataFrame(confusion_matrix(y_res, y_pred),
columns = ['Pred 0', 'Pred 1'],
index = ['True 0', 'True 1'])
| Pred 0 | Pred 1 | |
|---|---|---|
| True 0 | 2293 | 627 |
| True 1 | 626 | 2294 |
# Score on train data
accuracy_score(y_res, y_pred)
0.7854452054794521
Logistic regression presents worse results than SVM, however we will compute the train and test score to input in the final table comparison. We will also mitigate any overfitting that can appear in the training.
logr2 = LogisticRegression(solver = 'lbfgs',
C = 0.5,
max_iter= 1000)
logr2.fit(X_final, y_res)
logr_train_score = round(accuracy_score(y_res , logr2.predict(X_final )), 2)
logr_test_score = round(accuracy_score(y_test_res, logr2.predict(X_test_final)), 2)
print(f'Train score: {logr_train_score}')
print(f'Test score: {logr_test_score}')
Train score: 0.78 Test score: 0.78
# Define model
gnb = GaussianNB()
# Train model
gnb.fit(X_final, y_res)
# Predict with train data
y_pred = gnb.predict(X_final)
# Confusion matrix
pd.DataFrame(confusion_matrix(y_res, y_pred),
columns = ['Pred 0', 'Pred 1'],
index = ['True 0', 'True 1'])
| Pred 0 | Pred 1 | |
|---|---|---|
| True 0 | 1877 | 1043 |
| True 1 | 818 | 2102 |
# Score on train data
accuracy_score(y_res, y_pred)
0.6813356164383562
gnb2 = GaussianNB()
gnb2.fit(X_final, y_res)
gnb_train_score = round(accuracy_score(y_res , gnb2.predict(X_final )), 2)
gnb_test_score = round(accuracy_score(y_test_res, gnb2.predict(X_test_final)), 2)
print(f'Train score: {gnb_train_score}')
print(f'Test score: {gnb_test_score}')
Train score: 0.68 Test score: 0.62
# Define model
rf = RandomForestClassifier(max_depth= 10)
# Train model
rf.fit(X_final, y_res)
# Predict with train data
y_pred = rf.predict(X_final)
# Confusion matrix
pd.DataFrame(confusion_matrix(y_res, y_pred),
columns = ['Pred 0', 'Pred 1'],
index = ['True 0', 'True 1'])
| Pred 0 | Pred 1 | |
|---|---|---|
| True 0 | 2906 | 14 |
| True 1 | 389 | 2531 |
# Score on train data
accuracy_score(y_res, y_pred)
0.9309931506849315
Lets evaluate the overfitting and the test score.
rf2 = RandomForestClassifier(max_depth= 10)
rf2.fit(X_final, y_res)
rf_train_score = round(accuracy_score(y_res , rf2.predict(X_final )), 2)
rf_test_score = round(accuracy_score(y_test_res, rf2.predict(X_test_final)), 2)
print(f'Train score: {rf_train_score}')
print(f'Test score: {rf_test_score}')
Train score: 0.94 Test score: 0.94
# Define model
extra = ExtraTreesClassifier()
# Train model
extra.fit(X_final, y_res)
# Predict with train data
y_pred = extra.predict(X_final)
# Confusion matrix
pd.DataFrame(confusion_matrix(y_res, y_pred),
columns = ['Pred 0', 'Pred 1'],
index = ['True 0', 'True 1'])
| Pred 0 | Pred 1 | |
|---|---|---|
| True 0 | 2920 | 0 |
| True 1 | 0 | 2920 |
# Score on train data
accuracy_score(y_res, y_pred)
1.0
extra2 = ExtraTreesClassifier(max_depth=11)
extra2.fit(X_final, y_res)
extra_train_score = round(accuracy_score(y_res , extra2.predict(X_final )), 2)
extra_test_score = round(accuracy_score(y_test_res, extra2.predict(X_test_final)), 2)
print(f'Train score: {extra_train_score}')
print(f'Test score: {extra_test_score}')
Train score: 0.9 Test score: 0.92
# Supported tree methods are `gpu_hist`, `approx`, and `hist`.
boost = xgb.XGBClassifier(
tree_method="gpu_hist",
objective = 'binary:logistic', #binary:logitraw
eval_metric = 'auc',
sampling_method = 'gradient_based',
n_jobs = -1,
# Default Parameters
learning_rate = 0.3,
max_depth = 6,
colsample_bytree =1,
subsample = 1,
min_split_loss = 0,
min_child_weight = 1
)
boost.fit(X_final, y_res)
# Predict with train data
y_pred = boost.predict(X_final)
# Confusion matrix
pd.DataFrame(confusion_matrix(y_res, y_pred),
columns = ['Pred 0', 'Pred 1'],
index = ['True 0', 'True 1'])
| Pred 0 | Pred 1 | |
|---|---|---|
| True 0 | 2920 | 0 |
| True 1 | 1 | 2919 |
# Score on train data
accuracy_score(y_res, y_pred)
0.9998287671232877
boost2 = xgb.XGBClassifier(tree_method="gpu_hist",
objective = 'binary:logistic', #binary:logitraw
eval_metric = 'auc',
sampling_method = 'uniform', #'gradient_based'
n_jobs = -1,
# Default Parameters
learning_rate = 0.15,
max_depth = 5,
colsample_bytree =1,
subsample = 1,
min_split_loss = 1,
min_child_weight = 1
)
boost2.fit(X_final, y_res)
xgb_train_score = round(accuracy_score(y_res , boost2.predict(X_final )), 2)
xgb_test_score = round(accuracy_score(y_test_res, boost2.predict(X_test_final)), 2)
print(f'Train score: {xgb_train_score}')
print(f'Test score: {xgb_test_score}')
Train score: 0.97 Test score: 0.97
data = [{'train score': svc_train_score, 'test score': svc_test_score},
{'train score': logr_train_score, 'test score': logr_test_score},
{'train score': gnb_train_score, 'test score': gnb_test_score},
{'train score': rf_train_score, 'test score': rf_test_score},
{'train score': extra_train_score, 'test score': extra_test_score},
{'train score': xgb_train_score, 'test score': xgb_test_score}]
df_resume = pd.DataFrame(data, index = ['SVM',
'Logistic Regr.',
'Naive Bayes',
'Random Forest',
'ExtraTrees',
'XGBoost'])
df_resume
| train score | test score | |
|---|---|---|
| SVM | 0.84 | 0.88 |
| Logistic Regr. | 0.78 | 0.78 |
| Naive Bayes | 0.68 | 0.62 |
| Random Forest | 0.94 | 0.94 |
| ExtraTrees | 0.90 | 0.92 |
| XGBoost | 0.97 | 0.97 |
The XGBoost model presented the best score for both train and test, with small overfitting that can be improved in model tuning. Also, the XGBoost has GPU support which make the model much faster to train, allowing to test more combinations in the model tuning. For both this reason, will be continue with XGBoost only.
fig, ax = plt.subplots(1,1,figsize=(12,10))
plot_importance(boost2, ax=ax)
plt.tight_layout()
plt.show()
By considering the feature importance from the XGBoost model it can be seen that the total evening and international call minutes are the two most important aspects to evaluate if a customer is churn or not. Other feature to consider are account length and total charge.
Because cross-validation already considers a train and test set by dividing the entire data set, we will join our train and test data into a single df and perform the feature engineering.
df_tuned = pd.concat([X_train, X_test])
y_tuned = pd.concat([y_train, y_test])
print(X_train.shape[0]+X_test.shape[0] == df_tuned.shape[0])
print(y_train.shape[0]+y_test.shape[0] == y_tuned.shape[0])
True True
df_tuned_fe = df_feat_eng(df_tuned)
y_res = y_tuned.map(dict(yes=1, no=0))
xgb_estimator = xgb.XGBClassifier(tree_method="gpu_hist",
objective = 'binary:logistic',
eval_metric = 'auc',
sampling_method = 'gradient_based',
n_jobs = -1
)
parameters = {
'eta': [0.001, 0.01, 0.03, 0.05, 0.075, 0.1, 0.3],
'max_depth': range(2, 7, 1),
'alpha': np.arange(0, 1, 0.1),
'lambda': np.arange(0, 1, 0.1)
}
%%time
xgb_grid = RandomizedSearchCV(
estimator=xgb_estimator,
param_distributions=parameters,
n_iter = 1500,
scoring = 'accuracy',
n_jobs = -1,
cv = 8,
verbose = 3
)
xgb_grid.fit(df_tuned_fe, y_res)
Fitting 8 folds for each of 1500 candidates, totalling 12000 fits CPU times: total: 55.1 s Wall time: 1h 23min 50s
RandomizedSearchCV(cv=8,
estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='auc', gamma=None,
gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=None, max_bin=None...
n_estimators=100, n_jobs=-1,
num_parallel_tree=None,
predictor=None, random_state=None,
reg_alpha=None, reg_lambda=None, ...),
n_iter=1500, n_jobs=-1,
param_distributions={'alpha': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
'eta': [0.001, 0.01, 0.03, 0.05, 0.075,
0.1, 0.3],
'lambda': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
'max_depth': range(2, 7)},
scoring='accuracy', verbose=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomizedSearchCV(cv=8,
estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='auc', gamma=None,
gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=None, max_bin=None...
n_estimators=100, n_jobs=-1,
num_parallel_tree=None,
predictor=None, random_state=None,
reg_alpha=None, reg_lambda=None, ...),
n_iter=1500, n_jobs=-1,
param_distributions={'alpha': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
'eta': [0.001, 0.01, 0.03, 0.05, 0.075,
0.1, 0.3],
'lambda': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
'max_depth': range(2, 7)},
scoring='accuracy', verbose=3)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='auc', gamma=None,
gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=-1,
num_parallel_tree=None, predictor=None, random_state=None,
reg_alpha=None, reg_lambda=None, ...)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='auc', gamma=None,
gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=-1,
num_parallel_tree=None, predictor=None, random_state=None,
reg_alpha=None, reg_lambda=None, ...)#save model
import pickle
pkl_filename = 'final_model.pkl'
with open(pkl_filename, 'wb') as file:
pickle.dump(xgb_grid, file)
#load model
with open(pkl_filename, 'rb') as file:
xgb_grid = pickle.load(file)
print(f'Best XGBoost Score: {round(xgb_grid.best_score_,3 )}')
print(f'Best XGBoost Parameters: {xgb_grid.best_params_}')
Best XGBoost Score: 0.982
Best XGBoost Parameters: {'max_depth': 3, 'lambda': 0.30000000000000004, 'eta': 0.3, 'alpha': 0.1}
The cross validation gave the best parameters has:
Lets evaluate the final model's score using the validation data set that was not used until now.
X_val_final = df_feat_eng(X_val)
accuracy_score(y_val_res, xgb_grid.predict(X_val_final))
0.9670588235294117
The final score using the validation dataset was 0.978 which is close to the score we got from the training using crossing validation. We will need to train the model model with the best parameters and will all the data to include as many information as possible before prediction on the final data set that will be submitted to kaggle to get the final score.
# Perform feature engineering on all data
X_final_df = df_feat_eng(X)
y_final = y.map(dict(yes=1, no=0))
# Train model
xgb_best = xgb.XGBClassifier(tree_method="gpu_hist",
objective = 'binary:logistic',
eval_metric = 'auc',
sampling_method = 'gradient_based',
n_jobs = -1,
eta = 0.1,
max_depth = 3,
reg_lambda = 0.8,
alpha = 0.7
)
xgb_best.fit(X_final_df, y_final)
XGBClassifier(alpha=0.7, base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False, eta=0.1,
eval_metric='auc', gamma=0, gpu_id=0, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.100000001, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100,
n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=0, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(alpha=0.7, base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False, eta=0.1,
eval_metric='auc', gamma=0, gpu_id=0, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.100000001, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100,
n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=0, ...)# Import new dataset
new_data = pd.read_csv('data/test.csv', delimiter=',')
id_col = new_data.id
new_data = new_data.drop('id', axis = 1)
# Feature Engineering and predict
new_data_fe = df_feat_eng(new_data)
churn = xgb_best.predict(new_data_fe)
churn2 = ['yes' if x == 1 else 'no' for x in churn]
export_df = pd.DataFrame({'id': id_col, 'churn': churn2})
export_df.to_csv('results_2022_08_21_v1.csv', sep = ',', index = False)
Final Score:
We were able to identify the probable churn clients with a 98 % accuracy. Considering that 85% of the customers on the train that were "no churn" we were able to improve the final prediction.